Conversation
Handle replication requests on dedicated replication thread, this prevents possible goroutine exhaustion attack vector and cleanly handles shutdown.
There was a problem hiding this comment.
Code Review
This pull request introduces several improvements to resource management, security, and the replication protocol. Key changes include adding defer Close() calls for database handles and network streams, implementing a 16MB maximum message size limit in CBOR deserialization to prevent potential DoS, and refactoring ledger replication to use a single-worker queue instead of unbounded goroutines. Additionally, the FGPTxSystem was updated with stricter configuration validation and sanity checks for chain tips. Feedback suggests that the 16MB message limit might be too restrictive for the default replication block count, the replication worker queue capacity could be increased to avoid frequent busy responses, and the block slice allocation in the recovery logic should be pre-allocated for better performance.
| return append(lengthBytes[:bytesWritten], data...), nil | ||
| } | ||
|
|
||
| const maxMsgSize = 16 * 1024 * 1024 // 16 MB |
There was a problem hiding this comment.
The maxMsgSize of 16MB might be too restrictive when combined with the default maxReturnBlocks of 1000 in ledger replication. If blocks are larger than ~16KB on average, the resulting LedgerReplicationResponse will exceed this limit and be rejected by the receiver. Consider either increasing this limit or ensuring that the replication logic can handle splitting large responses.
|
|
||
| func (n *Node) processReplicationRequest(ctx context.Context, lr *replication.LedgerReplicationRequest) { | ||
| startBlock := lr.BeginBlockNumber | ||
| blocks := make([]*types.Block, 0) |
There was a problem hiding this comment.
| select { | ||
| case n.replicationCh <- replicationRequest{ctx: ctx, req: lr}: | ||
| // worker will process | ||
| default: | ||
| resp := &replication.LedgerReplicationResponse{ | ||
| UUID: lr.UUID, | ||
| Status: replication.Ok, | ||
| Blocks: blocks, | ||
| FirstBlockNumber: firstFetchedBlockNumber, | ||
| LastBlockNumber: lastFetchedBlockNumber, | ||
| UUID: lr.UUID, | ||
| Status: replication.Busy, | ||
| Message: "Too many concurrent replication requests", | ||
| } | ||
| return n.sendLedgerReplicationResponse(ctx, resp, lr.NodeID) | ||
| } |
There was a problem hiding this comment.
The replicationCh has a capacity of 1, meaning only one replication request can be queued while another is being processed. In a network with many nodes, this might lead to frequent Busy responses during recovery. Consider increasing the channel capacity or implementing a more flexible worker pool if high concurrency is expected.
| if stream != nil { | ||
| stream.Close() | ||
| } | ||
| }() |
There was a problem hiding this comment.
Issue #2 mentioned a second bug alongside the stream leak: SendMsgs still return nils at the end of the function (line 140), discarding any resErr accumulated when serialization of an individual message fails. Should be return resErr so the caller sees partial-failure errors.
| if tip.Height < block.Height { | ||
| return nil, fmt.Errorf("proposed block height %d is ahead of tip height %d", block.Height, tip.Height) | ||
| } | ||
| if tip.Height-block.Height < s.dFG-1 { |
There was a problem hiding this comment.
Issue #5's Related note points out that FollowerVerify is missing two checks that LeaderPropose performs:
- No check that
block.Height > s.committedStateHeight— a follower can accept a proposal at or below the last finalized height. - No fork-safety check via
GetChainTips—LeaderProposeverifies no competing fork is visible at the candidate height, but a follower does not repeat that verification against its own PoW view.
Both weaken safety guarantees vs. what the leader enforces. Worth either adding here or tracking as a follow-up.
Fixes #1 #2 #3 #4 #5 #7 #8 #10 #12
Requires unicity-node branch "bft-snapshots".